Binary Neural Networks Algorithms, Architectures, and Applications (Baochang Zhang, Sheng Xu, Mingbao Lin etc.)

Applications in Computer Vision

6.1

Introduction

In this section, we introduce the applications of binary neural networks in the ﬁeld of com-

puter vision. Speciﬁcally, we introduce the vision tasks including person re-identiﬁcation, 3D

point cloud processing, object detection, and speech recognition. First, we brieﬂy overview

these areas.

6.1.1

Person Re-Identiﬁcation

A large family of person re-id research focuses on metric learning loss. Some of them in-

troduce veriﬁcation loss [248] into identiﬁcation loss, others apply triplet loss with hard

sample mining [41, 203]. Recent eﬀorts employ pedestrian attributes to improve supervision

and work for multi-task learning [213, 232]. One of the mainstream methods is horizontally

splitting input images or feature maps to take advantage of local spatial cues [132, 219, 271].

Similarly, pose estimation is incorporated into the learning of local features [212, 214]. Fur-

thermore, human parsing is used in [111] to enhance spatial matching. In comparison, our

DG-Net relies only on simple identiﬁcation loss for Re-ID learning and does not require

extra auxiliary information such as pose or human parsing for image generation.

Another active research line is to use GANs [76] to augment training data. [294] is ﬁrst

introduced to use unconditional GAN to generate images from random vectors. Huang et

al. proceed in this direction with WGAN [4] and assign pseudo-labels to generated images

[95]. Li et al. propose to share weights between re-id model and discriminator of GAN [76].

In addition, some recent methods use pose estimation to generate pose-conditioned images.

In [103] a two-stage generation pipeline is developed based on pose to reﬁne the generated

images. Similarly, pose is also used in [71] to generate images of a pedestrian in diﬀerent

poses to make the learned features more robust to pose variances.

Meanwhile, some recent studies also exploit synthetic data for the style transfer of pedes-

trian images to compensate for the disparity between the source and target domains. Cycle-

GAN [300] is applied in [296] to transfer the style of pedestrian image from one data set to

another. StarGAN [44] is used in [295] to generate pedestrian images with diﬀerent camera

styles. Bak et al. [7] employ a game engine to render pedestrians using various illumination

conditions. Wei et al. [241] take semantic segmentation to extract the foreground mask to

assist with style transfer.

6.1.2

3D Point Cloud Processing

PointNet [192] is the ﬁrst deep learning model that processes the point cloud. The ba-

sic building blocks proposed by PointNet, such as multi-layer perceptrons for point-wise

feature extraction and max/average pooling for global aggregation, have become a popular

design choice for various categories of newer backbones. PointNet++ [193] exploits the met-

DOI: 10.1201/9781003376132-6

149